How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
In part 1, we prepped our dataset. In part 2, we train our model
Last Updated: Nov 24, 2023
Comment
In our previous article on datasets for instruction tuning, we explored how to create an instruction dataset for a Llama 2 model. In this article, we'll fine-tune it using the Alpaca dataset we previously prepared.
在我們先前關於指令調整資料集的文章中,我們探討如何為 Llama 2 模型建立指令資料集。在本文中,我們將使用先前準備的 Alpaca 資料集對其進行微調。
在我們先前關於指令調整資料集的文章中,我們探討如何為 Llama 2 模型建立指令資料集。在本文中,我們將使用先前準備的 Alpaca 資料集對其進行微調。
This codebase and article aim to be pedagogical and straightforward. The main goal here is to understand what is happening under the hood when you fine-tune an LLM for instruction tuning.
本程式碼庫和文章旨在具有教學意義且簡單明了。這裡的主要目標是了解當您對 LLM 進行指令調整時,幕後發生了什麼。
本程式碼庫和文章旨在具有教學意義且簡單明了。這裡的主要目標是了解當您對 LLM 進行指令調整時,幕後發生了什麼。
There are more sophisticated training recipes out there like the Hugging Face transformers' Trainer, trl, Axolotl, Peft, llama_recipes, the alignement_handbook, etc. In this article, we will try our best to make it is as simple as possible and make the training loop straightforward to follow.
還有更複雜的訓練食譜,例如 Hugging Face Transformers 的 Trainer、trl、Axolotl、Peft、llama_recipes、alignment_handbook 等。在本文中,我們將盡力使其盡可能簡單並使訓練更加簡單循環簡單易懂。
Llama al alpaca meeting again (the left one looks more Guanaco to me but that might be a personal thing)
Llama al alpaca 再次相遇(左邊的在我看來更像原駝,但這可能是個人的事情)
Llama al alpaca 再次相遇(左邊的在我看來更像原駝,但這可能是個人的事情)
What We'll Be Covering:
我們將涵蓋的內容:
Downloading the Preprocessed Dataset from W&B
從 W&B 下載預處理資料集Loading Local JSON Data from Disk Using HuggingFace Datasets
使用 HuggingFace 資料集從磁碟載入本機 JSON 數據DataLoader 資料載入器Training Loop 訓練循環Freezing the Model to Save Memory: 🥶 Jeremy Howard Style
凍結模型以節省記憶體:🥶 Jeremy Howard 風格Optimizer and Scheduler 優化器和調度器Sampling from the Model 從模型中取樣Validation Step 驗證步驟A Simple PyTorch Training Loop for Your LLM
適合法學碩士的簡單 PyTorch 訓練循環Results 結果GPT-4 based evaluation 基於 GPT-4 的評估Evaluation Results 評估結果Conclusion and Final Remarks
結論和最後評論
從 W&B 下載預處理資料集Loading Local JSON Data from Disk Using HuggingFace Datasets
使用 HuggingFace 資料集從磁碟載入本機 JSON 數據DataLoader 資料載入器Training Loop 訓練循環Freezing the Model to Save Memory: 🥶 Jeremy Howard Style
凍結模型以節省記憶體:🥶 Jeremy Howard 風格Optimizer and Scheduler 優化器和調度器Sampling from the Model 從模型中取樣Validation Step 驗證步驟A Simple PyTorch Training Loop for Your LLM
適合法學碩士的簡單 PyTorch 訓練循環Results 結果GPT-4 based evaluation 基於 GPT-4 的評估Evaluation Results 評估結果Conclusion and Final Remarks
結論和最後評論
Downloading the Preprocessed Dataset from W&B
從 W&B 下載預處理資料集
Let's get started. In the previous article, we saved our preprocessed dataset as a Weights & Biases Artifact, so we can easily retrieve the dataset from there. Here's the code:
讓我們開始吧。在上一篇文章中,我們將預處理的資料集保存為權重和偏差工件,因此我們可以輕鬆地從那裡檢索資料集。這是代碼:
讓我們開始吧。在上一篇文章中,我們將預處理的資料集保存為權重和偏差工件,因此我們可以輕鬆地從那裡檢索資料集。這是代碼:
import wandbfrom pathlib import Pathrun = wandb.init(project="alpaca_ft")artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca:v0', type='dataset')artifact_dir = Path(artifact.download())
As we kept the dataset as plain JSON files, we can open them directly using the Python built-in json module:
由於我們將資料集保留為純 JSON 文件,因此我們可以使用 Python 內建 json 模組直接打開它們:
由於我們將資料集保留為純 JSON 文件,因此我們可以使用 Python 內建 json 模組直接打開它們:
import jsondef load_jsonl(filename):data = []with open(filename, 'r') as file:for line in file:data.append(json.loads(line))return datatrain_ds_packed = load_jsonl(artifact_dir/"train_packed_alpaca.jsonl")eval_ds_packed = load_jsonl(artifact_dir/"eval_packed_alpaca.jsonl")
From there, we can continue our training!
從那裡,我們可以繼續我們的訓練!
從那裡,我們可以繼續我們的訓練!
Loading Local JSON Data from Disk Using HuggingFace Datasets
使用 HuggingFace 資料集從磁碟載入本機 JSON 數據
A better container for your dataset than plain JSON might be something like the Hugging Face datasets library. This has many advantages, such as fast loading, built-in map/filter methods, and bucket streaming, among others. You can quickly convert the jsonl files we created to datasets format by using the load_from_disk method:
比普通 JSON 更好的資料集容器可能是 Hugging Face 資料集庫。這有很多優點,例如快速加載、內建映射/過濾方法和桶流等。您可以使用 load_from_disk 方法快速將我們建立的 jsonl 檔案轉換為資料集格式:
比普通 JSON 更好的資料集容器可能是 Hugging Face 資料集庫。這有很多優點,例如快速加載、內建映射/過濾方法和桶流等。您可以使用 load_from_disk 方法快速將我們建立的 jsonl 檔案轉換為資料集格式:
import wandbfrom datasets import load_from_disk # for some reason load_dataset gives an errorrun = wandb.init(project="alpaca_ft")artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca_hf:v0', type='dataset')artifact_dir = artifact.download()ds_packed = load_from_disk(artifact_dir)# we are back where we started!train_ds_packed = ds_packed["train"]eval_ds_packed = ds_packed["eval"]
DataLoader 資料載入器
As we are training for completion, the labels (or targets) will be the inputs shifted by one. We will train with regular cross-entropy and predict the next token on this packed dataset.
當我們訓練完成時,標籤(或目標)將是輸入移動一位。我們將使用常規交叉熵進行訓練並預測此打包資料集上的下一個標記。
當我們訓練完成時,標籤(或目標)將是輸入移動一位。我們將使用常規交叉熵進行訓練並預測此打包資料集上的下一個標記。
as input and target are the same but shifted, we lose one token at each end
由於輸入和目標相同但發生了變化,因此我們在每一端都會丟失一個令牌
由於輸入和目標相同但發生了變化,因此我們在每一端都會丟失一個令牌
In code, we accomplish this by setting the labels as the inputs shifted by one:
在程式碼中,我們透過將標籤設定為輸入移動一位來實現這一點:
在程式碼中,我們透過將標籤設定為輸入移動一位來實現這一點:
{"input_ids": input_ids[:-1], "labels": input_ids[1:]} # you actually drop one value
Beware that the Hugging Face model does this for you automatically when computing the loss on the model attribute (`ModelOutput.loss`), in that case, inputs and labels are identical.
請注意,在計算模型屬性(「ModelOutput.loss」)的損失時,Hugging Face 模型會自動為您執行此操作,在這種情況下,輸入和標籤是相同的。
from torch.utils.data import DataLoaderfrom transformers import default_data_collatorbatch_size = 8 # I have an A100 GPU with 40GB of RAM 😎train_dataloader = DataLoader(train_ds_packed,batch_size=batch_size,collate_fn=default_data_collator, # we don't need any special collator 😎)eval_dataloader = DataLoader(eval_ds_packed,batch_size=batch_size,collate_fn=default_data_collator,shuffle=False,)
It's always a good idea to check what a batch looks like. You can quickly do this by sampling from the DataLoader:
檢查批次的情況總是一個好主意。您可以透過從 DataLoader 取樣來快速完成此操作:
檢查批次的情況總是一個好主意。您可以透過從 DataLoader 取樣來快速完成此操作:
b = next(iter(train_dataloader))b.keys(), b["input_ids"][0][:25], b["labels"][0][:25]>> (dict_keys(['input_ids', 'labels']),tensor([ 1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889,14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009,29889, 13, 13, 2277, 29937]),tensor([13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350,263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889,13, 13, 2277, 29937, 2799])) ### <<< ---- shifted by 1# input_ids.shape: (16, 1024), labels.shape: (16, 1024)
Everything looks fine; let's train this thing!
一切看起來都很好;讓我們訓練這個東西吧!
一切看起來都很好;讓我們訓練這個東西吧!
Training Loop 訓練循環
We'll start by training a model, naively making the model complete the sentence. As an exercise, I will implement this in pure PyTorch, so no abstractions are present besides grabbing the pre-trained model from the HuggingFace Hub.
我們將從訓練一個模型開始,讓模型簡單地完成句子。作為練習,我將在純 PyTorch 中實現這一點,因此除了從 HuggingFace Hub 獲取預先訓練的模型之外,不存在任何抽象。
我們將從訓練一個模型開始,讓模型簡單地完成句子。作為練習,我將在純 PyTorch 中實現這一點,因此除了從 HuggingFace Hub 獲取預先訓練的模型之外,不存在任何抽象。
I like storing the configuration hyperparameters in a SimpleNamespace. It's like a dictionary with .dot attribute access. Then, I can access my batch size by doing config.batch_size instead of config["batch_size"].
我喜歡將配置超參數儲存在 SimpleNamespace 中。它就像是具有 .dot 屬性存取權的字典。然後,我可以透過執行 config.batch_size 而不是 config["batch_size"] 來存取我的批次大小。
我喜歡將配置超參數儲存在 SimpleNamespace 中。它就像是具有 .dot 屬性存取權的字典。然後,我可以透過執行 config.batch_size 而不是 config["batch_size"] 來存取我的批次大小。
We will use some necessary tricks to make this possible:
我們將使用一些必要的技巧來實現這一點:
我們將使用一些必要的技巧來實現這一點:
- We're going to train a subset of the model parameters instead of the full model
我們將訓練模型參數的子集而不是完整模型 - We're going to use gradient checkpointing to save on GPU memory. Checkpointing is a method that mitigates memory usage by eliminating and recalculating certain layers' activations during a backward pass, trading additional computation time for decreased memory usage.
我們將使用梯度檢查點來節省 GPU 記憶體。檢查點是一種透過在向後傳遞期間消除和重新計算某些層的激活來減少記憶體使用的方法,以額外的計算時間換取減少的記憶體使用。 - Automatic Mixed Precision: This technique makes training considerably faster as the computations are done in half-precision (float16 or bfloat16). You can read more about this technique here.
自動混合精度:由於計算以半精度(float16 或 bfloat16)完成,因此該技術使訓練速度顯著加快。您可以在這裡閱讀有關此技術的更多資訊。 - We will implement an evaluation step that will sample from the model regularly.
我們將實施一個評估步驟,定期從模型中採樣。
Let's get started! 讓我們開始吧!
from types import SimpleNamespacegradient_accumulation_steps = 32 // batch_sizeconfig = SimpleNamespace(model_id='meta-llama/Llama-2-7b-hf',dataset_name="alpaca-gpt4",precision="bf16", # faster and better than fp16, requires new GPUsn_freeze=24, # How many layers we don't train, LLama 7B has 32.lr=2e-4,n_eval_samples=10, # How many samples to generate on validationmax_seq_len=max_sequence_len, # Length of the sequences to packepochs=3, # we do 3 pasess over the dataset.gradient_accumulation_steps=gradient_accumulation_steps, # evey how many iterations we update the gradients, simulates larger batch sizesbatch_size=batch_size, # what my GPU can handle, depends on how many layers are we traininglog_model=False, # upload the model to W&B?mom=0.9, # optim paramgradient_checkpointing = True, # saves even more memoryfreeze_embed = True, # why train this? let's keep them frozen ❄️)config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps
We first get a pre-trained model with some configuration parameters:
我們首先得到一個帶有一些配置參數的預訓練模型:
我們首先得到一個帶有一些配置參數的預訓練模型:
model = AutoModelForCausalLM.from_pretrained(config.model_id,device_map=0,trust_remote_code=True,low_cpu_mem_usage=True,torch_dtype=torch.bfloat16,use_cache=False,)
Freezing the Model to Save Memory: 🥶 Jeremy Howard Style
凍結模型以節省記憶體:🥶 Jeremy Howard 風格
Training the full models is expensive, but if you have a GPU that can fit the entire model, you can skip this part. Instead, we will train a subset of the model parameters. This technique has worked in other domains and was pioneered by Jeremy and Seb Ruder.
訓練完整模型的成本很高,但如果您有可以容納整個模型的 GPU,則可以跳過這部分。相反,我們將訓練模型參數的子集。這項技術由 Jeremy 和 Seb Ruder 首創,已在其他領域發揮作用。
訓練完整模型的成本很高,但如果您有可以容納整個模型的 GPU,則可以跳過這部分。相反,我們將訓練模型參數的子集。這項技術由 Jeremy 和 Seb Ruder 首創,已在其他領域發揮作用。
Transformer-based models like Llama are a stack of identical layers on top of each other with a classification layer at the end. Llama 2-7b has 32 transformer layers, so we will only train the last 8 of them. You can experiment with how many layers to freeze. You always want to train the classification head (the last layer that makes the predictions).
像 Llama 這樣基於 Transformer 的模型是一堆相同的層相互疊加,最後有一個分類層。 Llama 2-7b 有 32 個 Transformer 層,因此我們只會訓練最後 8 個。您可以嘗試凍結多少層。您總是想要訓練分類頭(進行預測的最後一層)。
像 Llama 這樣基於 Transformer 的模型是一堆相同的層相互疊加,最後有一個分類層。 Llama 2-7b 有 32 個 Transformer 層,因此我們只會訓練最後 8 個。您可以嘗試凍結多少層。您總是想要訓練分類頭(進行預測的最後一層)。
In the rest of this piece, we'll explore how one can train the full model leveraging efficient parameter fine-tuning techniques like LoRA.
在本文的其餘部分中,我們將探討如何利用 LoRA 等高效率的參數微調技術來訓練完整的模型。
在本文的其餘部分中,我們將探討如何利用 LoRA 等高效率的參數微調技術來訓練完整的模型。
This technique has proven to work well in most cases; try it!
事實證明,這種技術在大多數情況下效果很好;嘗試一下!
事實證明,這種技術在大多數情況下效果很好;嘗試一下!
Before trying fancy parameter-efficient methods, let's go Jeremy style and freeze most model layers. After loading the model, we freeze most of it. This way, we save a ton of memory by not computing gradients on the frozen layers.
在嘗試花哨的參數高效方法之前,讓我們採用 Jeremy 風格並凍結大多數模型層。載入模型後,我們凍結大部分模型。這樣,我們就可以透過不在凍結層上計算梯度來節省大量記憶體。
在嘗試花哨的參數高效方法之前,讓我們採用 Jeremy 風格並凍結大多數模型層。載入模型後,我們凍結大部分模型。這樣,我們就可以透過不在凍結層上計算梯度來節省大量記憶體。
n_freeze = 24. # you can play with this parameter# freeze layers (disable gradients)for param in model.parameters(): param.requires_grad = Falsefor param in model.lm_head.parameters(): param.requires_grad = Truefor param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True>> Total params: 6738.42M, Trainable: 1750.14M
You can even gain a little bit more memory by freezing the embeddings!
您甚至可以透過凍結嵌入來獲得更多的記憶體!
您甚至可以透過凍結嵌入來獲得更多的記憶體!
# Just freeze embeddings for small memory decreaseif config.freeze_embed:model.model.embed_tokens.weight.requires_grad_(False);
You can also use gradient checkpointing to save even more (this makes training slower; how much it will depend on your particular configuration). There is an excellent article on the Huggingface website about how to fit large models on memory; I encourage you to check it!
您還可以使用梯度檢查點來節省更多(這會使訓練速度變慢;多少取決於您的特定配置)。 Huggingface 網站上有一篇關於如何在記憶體中擬合大型模型的優秀文章;我鼓勵你檢查一下!
您還可以使用梯度檢查點來節省更多(這會使訓練速度變慢;多少取決於您的特定配置)。 Huggingface 網站上有一篇關於如何在記憶體中擬合大型模型的優秀文章;我鼓勵你檢查一下!
# save more memoryif config.gradient_checkpointing:model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False}) # <- pytorch changed this
💡
Had to fix this cell as the "use_reentrant" argument is needed to make gradients flow form the frozen embedding!
必須修復此單元格,因為需要“use_reentrant”參數才能使梯度從凍結嵌入中流動!
必須修復此單元格,因為需要“use_reentrant”參數才能使梯度從凍結嵌入中流動!
💡
Optimizer and Scheduler 優化器和調度器
We'll now set the optimizer and scheduler for our training. We need this to tell PyTorch how to compute the optimization step and adjust the learning rate accordingly. There are probably fancier techniques to try, but Adam and cosine_schedule are safe starting points. We will also set up our training loop using bfloat, to make good use of those TensorCores available on modern Nvidia GPUs. We will also set up the loss_fn as Cross Entropy.
我們現在將為我們的訓練設定優化器和調度器。我們需要它來告訴 PyTorch 如何計算最佳化步驟並相應地調整學習率。可能有更高級的技術可以嘗試,但 Adam 和 cosine_schedule 是安全的起點。我們還將使用 bfloat 設定訓練循環,以充分利用現代 Nvidia GPU 上可用的 TensorCore。我們也將 loss_fn 設定為交叉熵。
我們現在將為我們的訓練設定優化器和調度器。我們需要它來告訴 PyTorch 如何計算最佳化步驟並相應地調整學習率。可能有更高級的技術可以嘗試,但 Adam 和 cosine_schedule 是安全的起點。我們還將使用 bfloat 設定訓練循環,以充分利用現代 Nvidia GPU 上可用的 TensorCore。我們也將 loss_fn 設定為交叉熵。
from transformers import get_cosine_schedule_with_warmupoptim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)scheduler = get_cosine_schedule_with_warmup(optim,num_training_steps=config.total_train_steps,num_warmup_steps=config.total_train_steps // 10,)def loss_fn(x, y):"A Flat CrossEntropy"return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))
We grab the scheduler from the transformer library; why not? It's already there waiting for us. You can implement the scheduler Karpathy's style if you like.
我們從變壓器庫中取得調度程序;為什麼不?它已經在那兒等著我們了。如果您願意,您可以實現調度程序 Karpathy 的風格。
我們從變壓器庫中取得調度程序;為什麼不?它已經在那兒等著我們了。如果您願意,您可以實現調度程序 Karpathy 的風格。
Sampling from the Model 從模型中取樣
We are almost there! Let's create a simple function to sample from the model now and then to visually see what the model is outputting.
我們就快到了!讓我們建立一個簡單的函數來立即從模型中取樣,然後直觀地查看模型輸出的內容。
我們就快到了!讓我們建立一個簡單的函數來立即從模型中取樣,然後直觀地查看模型輸出的內容。
Let's wrap the model.generate method for simplicity. You can grab the default sampling parameters from the GenerationConfig and pass the corresponding model_id. This will hold the defaults for parameters like temperature, top p, etc.
為了簡單起見,讓我們包裝 model.generate 方法。您可以從 GenerationConfig 取得預設取樣參數並傳遞對應的 model_id。這將保留溫度、頂部 p 等參數的預設值。
為了簡單起見,讓我們包裝 model.generate 方法。您可以從 GenerationConfig 取得預設取樣參數並傳遞對應的 model_id。這將保留溫度、頂部 p 等參數的預設值。
from transformers import GenerationConfiggen_config = GenerationConfig.from_pretrained(config.model_id)def generate(prompt, max_new_tokens=100, gen_config=gen_config):with torch.inference_mode():tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()output = model.generate(tokenized_prompt,max_new_tokens=max_new_tokens,generation_config=gen_config)return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)
We'll run our model over the eval_dataset every 1/10th of the total train steps and log a table to Weights & Biases containing the model predictions. We will also add the relevant sampling parameters in case we change them later on.
我們將每隔總訓練步驟的 1/10 在 eval_dataset 上運行我們的模型,並將包含模型預測的表格記錄到權重和偏差。我們還將添加相關的採樣參數,以防以後更改它們。
我們將每隔總訓練步驟的 1/10 在 eval_dataset 上運行我們的模型,並將包含模型預測的表格記錄到權重和偏差。我們還將添加相關的採樣參數,以防以後更改它們。
def prompt_table(prompts, log=True):table = wandb.Table(columns=["prompt", "generation", "concat", "max_new_tokens", "temperature", "top_p"])for prompt in progress_bar(prompts):out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)table.add_data(prompt, out, prompt+out, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)if log:wandb.log({"predictions":table})return table
Validation Step 驗證步驟
You should always have some validation done during training runs. You may skip this if the training is concise, but computing metrics on a validation dataset can give you precious insight into how the training is going. For LLMs, you also want to sample from the model to visualize how the alignment with your data is going. We implement a validate function that does a couple of things:
您應該始終在訓練運行期間完成一些驗證。如果訓練很簡潔,您可以跳過此步驟,但在驗證資料集上計算指標可以讓您深入了解訓練的進行。對於法學碩士,您還希望從模型中進行取樣,以視覺化與資料的對齊情況。我們實作了一個驗證函數,它可以完成以下幾件事:
您應該始終在訓練運行期間完成一些驗證。如果訓練很簡潔,您可以跳過此步驟,但在驗證資料集上計算指標可以讓您深入了解訓練的進行。對於法學碩士,您還希望從模型中進行取樣,以視覺化與資料的對齊情況。我們實作了一個驗證函數,它可以完成以下幾件事:
- Iterates through the eval_dataloader and accumulates loss and accuracy
迭代 eval_dataloader 並累積損失和準確性 - Logs those metrics to W&B over the entire dataset
將整個資料集的這些指標記錄到 W&B - Sample from the model and log the generation to a W&B Table.
從模型中取樣並將產生結果記錄到 W&B 表中。
@torch.no_grad()def validate():model.eval();eval_acc = Accuracy()for step, batch in enumerate(tqdm(eval_dataloader)):batch = to_gpu(batch)with torch.amp.autocast("cuda", dtype=torch.bfloat16):out = model(**batch)loss = loss_fn(out.logits, batch["labels"]) # you could use out.loss and not shift the dataseteval_acc.update(out.logits, batch["labels"])# we log results at the endwandb.log({"eval_loss": loss.item(),"eval_accuracy": eval_acc.compute()})prompt_table(eval_dataset[:config.n_eval_samples], log=True)model.train();
It's a good idea to run validation after some steps to assess that everything is going okay, as this is a short fine-tuning. You want to call validate at least a couple of times during training; this will depend on the task and the dataset size. For this experiment, we will perform validation 3 times (at the end of every epoch).
最好在執行一些步驟後執行驗證來評估一切是否正常,因為這是一個簡短的微調。您希望在訓練期間至少呼叫驗證幾次;這將取決於任務和資料集大小。對於這個實驗,我們將執行 3 次驗證(在每個 epoch 結束時)。
最好在執行一些步驟後執行驗證來評估一切是否正常,因為這是一個簡短的微調。您希望在訓練期間至少呼叫驗證幾次;這將取決於任務和資料集大小。對於這個實驗,我們將執行 3 次驗證(在每個 epoch 結束時)。
A Simple PyTorch Training Loop for Your LLM
適合法學碩士的簡單 PyTorch 訓練循環
This PyTorch training loop is a standard training loop that iterates through the train data loader and performs evaluation every a fixed amount of steps. It saves the model at the end of the training.
此 PyTorch 訓練循環是一個標準訓練循環,它迭代訓練資料載入器並每隔固定數量的步驟執行評估。它在訓練結束時保存模型。
此 PyTorch 訓練循環是一個標準訓練循環,它迭代訓練資料載入器並每隔固定數量的步驟執行評估。它在訓練結束時保存模型。
- Gradient accumulation: This technique enables us to simulate larger batch sizes, which is very useful when using GPUs with less memory capabilities.
梯度累積:該技術使我們能夠模擬更大的批次大小,這在使用記憶體容量較小的 GPU 時非常有用。 - Sampling and model checkpoint saving (this trains very fast, so there is no need to save multiple checkpoints)
採樣和模型檢查點保存(訓練速度非常快,因此不需要保存多個檢查點) - Compute token accuracy: It is a better metric than loss because it is easy to understand, as the accuracy number represents a quantity we can interpret. Also, let's not forget that this is a classification task for the next token prediction! If you don't believe me, Jeremy Howard still suggests accuracy for Causal Language Modeling as the metric to go with.
計算令牌準確性:這是一個比損失更好的指標,因為它很容易理解,因為準確性數字代表了我們可以解釋的數量。另外,我們不要忘記這是下一個標記預測的分類任務!如果你不相信我,傑里米·霍華德仍然建議使用因果語言建模的準確性作為衡量標準。
wandb.init(project="alpaca_ft", # the project I am working ontags=["baseline","7b"],job_type="train",config=config) # the Hyperparameters I want to keep track of# Trainingacc = Accuracy()model.train()train_step = 0pbar = tqdm(total=config.total_train_steps)for epoch in range(config.epochs):for step, batch in enumerate(train_dataloader):batch = to_gpu(batch)with torch.amp.autocast("cuda", dtype=torch.bfloat16):out = model(**batch)loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps # you could use out.loss and not shift the datasetloss.backward()if step%config.gradient_accumulation_steps == 0:# we can log the metrics to W&Bwandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,"train/accuracy": acc.update(out.logits, batch["labels"]),"train/learning_rate": scheduler.get_last_lr()[0],"train/global_step": train_step})optim.step()scheduler.step()optim.zero_grad(set_to_none=True)train_step += 1pbar.update(1)validate()pbar.close()# we save the model checkpoint at the endsave_model(model,model_name=config.model_id.replace("/", "_"),models_folder="models/", log=config.log_model)wandb.finish()
This trains in around 120 minutes on an A100.
在 A100 上訓練大約需要 120 分鐘。
在 A100 上訓練大約需要 120 分鐘。
The Hugging Face course has a similar to this one that uses pure PyTorch to train a model from the HF hub.
Hugging Face 課程與此課程類似,使用純 PyTorch 從 HF 中心訓練模式。
Hugging Face 課程與此課程類似,使用純 PyTorch 從 HF 中心訓練模式。
💡
Results 結果
We present the loss curves and the accuracy metrics. Our total training steps are around 1150 steps (3 epochs) with gradient accumulation steps = 4. We pass two samples before updating the gradients.
我們展示了損失曲線和準確性指標。我們的總訓練步驟約為 1150 個步驟(3 個時期),梯度累積步驟 = 4。我們在更新梯度之前傳遞兩個樣本。
我們展示了損失曲線和準確性指標。我們的總訓練步驟約為 1150 個步驟(3 個時期),梯度累積步驟 = 4。我們在更新梯度之前傳遞兩個樣本。
Our results: 我們的結果:
We can inspect the model generations from the table below! At first glance, the results look reasonable! Let's manually check a fixed row. The input is "Generate a list of 5 potential threats to digital security," and we can see on the generation column the evolution of the generations over time by clicking on the arrows at the bottom of the cell:
我們可以從下表檢查模型代數!乍一看,結果看起來很合理!讓我們手動檢查固定行。輸入是“生成數字安全的 5 個潛在威脅的列表”,我們可以通過單擊單元格底部的箭頭在生成列上看到各代隨時間的演變:
我們可以從下表檢查模型代數!乍一看,結果看起來很合理!讓我們手動檢查固定行。輸入是“生成數字安全的 5 個潛在威脅的列表”,我們可以通過單擊單元格底部的箭頭在生成列上看到各代隨時間的演變:
Run set 運行設定
1
These were samples generated and computed during training. We'll evaluate them on the entire test dataset in the next section.
這些是在訓練期間產生和計算的樣本。我們將在下一節中在整個測試資料集上評估它們。
這些是在訓練期間產生和計算的樣本。我們將在下一節中在整個測試資料集上評估它們。
GPT-4 based evaluation 基於 GPT-4 的評估
Let's use GPT-4 to compare the results generated by the fine-tuned model against GPT-3.5. Let's also get GPT-4's reason for picking one over the other. This evaluation technique has been used in multiple places, for instance, MT Bench (MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants)
讓我們使用 GPT-4 將微調模型產生的結果與 GPT-3.5 進行比較。我們也可以了解 GPT-4 選擇其中一個的原因。這種評估技術已在多個地方使用,例如MT Bench(MT-bench是一組用於評估聊天助理的具有挑戰性的多輪開放式問題)
讓我們使用 GPT-4 將微調模型產生的結果與 GPT-3.5 進行比較。我們也可以了解 GPT-4 選擇其中一個的原因。這種評估技術已在多個地方使用,例如MT Bench(MT-bench是一組用於評估聊天助理的具有挑戰性的多輪開放式問題)
You can read more about LLM supervised evaluation in Ayush's article.
您可以在 Ayush 的文章中閱讀有關 LLM 監督評估的更多資訊。
您可以在 Ayush 的文章中閱讀有關 LLM 監督評估的更多資訊。
💡
GPT4 is better at reasoning than GPT3.5. Also, it wouldn't be fair to use the same model generating one of the responses to judge itself. Of course, this technique is not perfect, and other studies have shown that sometimes this evaluation strategy may not be consistent with permutation (switching the answers) or even calling the model multiple times, which could lead to different responses due to the stochastic nature of the generations. One way to mitigate this is setting up the temperature sampling parameters closer to zero to make the model more deterministic.
GPT4 比 GPT3.5 更擅長推理。此外,使用產生其中一個回應的相同模型來判斷自身是不公平的。當然,這種技術並不完美,其他研究表明,有時這種評估策略可能與排列(切換答案)甚至多次呼叫模型不一致,這可能會由於模型的隨機性而導致不同的回應幾代人。緩解這種情況的一種方法是將溫度採樣參數設定為接近零,以使模型更具確定性。
GPT4 比 GPT3.5 更擅長推理。此外,使用產生其中一個回應的相同模型來判斷自身是不公平的。當然,這種技術並不完美,其他研究表明,有時這種評估策略可能與排列(切換答案)甚至多次呼叫模型不一致,這可能會由於模型的隨機性而導致不同的回應幾代人。緩解這種情況的一種方法是將溫度採樣參數設定為接近零,以使模型更具確定性。
The clear win this approach has is that one can quickly implement LLM-based evaluation, and using a powerful model like GPT-4, we can create a baseline score quickly. Ideally, you would want to set up a human-based assessment at some point, but this is more costly and slower to implement.
這種方法的明顯優勢是可以快速實施基於 LLM 的評估,並且使用 GPT-4 等強大的模型,我們可以快速建立基線分數。理想情況下,您希望在某個時候建立基於人工的評估,但這成本更高,實施起來也更慢。
這種方法的明顯優勢是可以快速實施基於 LLM 的評估,並且使用 GPT-4 等強大的模型,我們可以快速建立基線分數。理想情況下,您希望在某個時候建立基於人工的評估,但這成本更高,實施起來也更慢。
We can leverage OpenAI function calling to format the output of GPT-4 with the corresponding choice made and the reason.
我們可以利用 OpenAI 函數呼叫來格式化 GPT-4 的輸出,並給出對應的選擇和原因。
我們可以利用 OpenAI 函數呼叫來格式化 GPT-4 的輸出,並給出對應的選擇和原因。
def gpt4_judge(instruction, gen1, gen2, model="gpt-4"):system_prompt = ("You will be presented with a choice of two possible responses for an instruction""You have to pick the best one and give a reason why.\n""The reponse should follow the instructions and use the provided context if there is some\n""If both answers are equivalent, pick the value 0")message = "{instruction}\n Answer 1: \n{gen1}\n Answer 2:\n{gen2}".format(instruction=instruction, gen1=gen1, gen2=gen2)completion = openai.chat.completions.create(model=model,messages=[{"role": "system","content": system_prompt,},{"role": "user","content": message,},],function_call = {"name": "make_choice"},functions = [{"name": "make_choice","description": "Select the best generation and explain why","parameters": {"type": "object","properties": {"choice": {"type": "integer","description": "the choosen alternative, zero if equivalent",},"argument":{"type": "string","description": "Reason why the choice was made",},},},"required": ["choice", "argument"],},],)return completion
You can inspect the results in the evaluation tables below. We generated 250 completions using GPT-3.5 and asked GPT-4 to pick the best one; we also left the possibility of marking both as equally good:
您可以在下面的評估表中檢查結果。我們使用 GPT-3.5 產生了 250 個補全,並要求 GPT-4 選擇最好的一個;我們也保留了將兩者標記為同樣好的可能性:
您可以在下面的評估表中檢查結果。我們使用 GPT-3.5 產生了 250 個補全,並要求 GPT-4 選擇最好的一個;我們也保留了將兩者標記為同樣好的可能性:
- Both models are good 兩個型號都不錯
- Fine-tuned Llama was better
微調過的 Llama 效果更好 - GPT-3.5 produced better output
GPT-3.5 產生更好的輸出
To make our testing more robust, we inverted the order and asked GPT-4 again, and we only kept the choices were GPT-4 consistently picked the same answer no matter the order. For our surprise 34 times GPT-4 switched sides! So take this evaluation with a grain of salt!
為了使我們的測試更加穩健,我們顛倒了順序並再次詢問 GPT-4,並且無論順序如何,我們只保留 GPT-4 始終選擇相同答案的選項。讓我們驚訝的是,GPT-4 已倒戈 34 次!所以對這個評價持保留態度!
為了使我們的測試更加穩健,我們顛倒了順序並再次詢問 GPT-4,並且無論順序如何,我們只保留 GPT-4 始終選擇相同答案的選項。讓我們驚訝的是,GPT-4 已倒戈 34 次!所以對這個評價持保留態度!
💡
Order matter: Inverting the query order makes GPT-4 switch sides (sometimes)
順序問題:反轉查詢順序會使 GPT-4 改變立場(有時)
You can check GPT-4 inconsistency here:
您可以在此處檢查 GPT-4 不一致情況:
您可以在此處檢查 GPT-4 不一致情況:
- Occasionally it prefers short answers and then switches sides and values the explanation and the longer answer 🤔
有時,它更喜歡簡短的答案,然後轉換立場並重視解釋和較長的答案🤔 - In some cases, it judges both equally good and, when inverted, prefers one.
在某些情況下,它對兩者的判斷同樣好,並且當相反時,它更喜歡其中一個。 - Check the answers below by clicking the < > 👇
點選 < > 👇 看下面的答案
Run set 運行設定
1
Evaluation Results 評估結果
You can also check why GPT-4 picked what on the "argument" column. The fine-tuned Llama is good but not near as good as GPT-3.5. This makes sense as to how the 7b model trained on a handful of GPT4 generations would be better than a probably much bigger model like GPT3.5. Anyway, other questions arise:
您也可以在「參數」欄位中查看 GPT-4 選擇的原因。經過微調的 Llama 不錯,但不如 GPT-3.5。這是有道理的,因為在少數 GPT4 世代上訓練的 7b 模型會比 GPT3.5 等可能更大的模型更好。無論如何,還會出現其他問題:
您也可以在「參數」欄位中查看 GPT-4 選擇的原因。經過微調的 Llama 不錯,但不如 GPT-3.5。這是有道理的,因為在少數 GPT4 世代上訓練的 7b 模型會比 GPT3.5 等可能更大的模型更好。無論如何,還會出現其他問題:
- Is the 7b model too small? If we switched to Llama2-13b, would the outcome be the same?
7b型號太小了嗎?如果我們改用 Llama2-13b,結果會一樣嗎? - Should we train more layers of the model? All layers?
我們應該訓練更多層的模型嗎?所有層? - We will explore some of these questions in the following articles.
我們將在接下來的文章中探討其中一些問題。
Run: decent-mountain-55 運行:decent-mountain-55
1
Conclusion and Final Remarks
結論和最後評論
Fine-tuning a model on an instruction dataset is just a particular case of completion training, where one constructs the dataset in an organized way so it can learn to follow instructions. This is a small example to demystify the complexity of what's happening under the hood when using specialized libraries to fine-tune.
在指令資料集上微調模型只是完成訓練的一種特殊情況,其中以有組織的方式建立資料集,以便它可以學習遵循指令。這是一個小例子,旨在揭秘使用專門庫進行微調時幕後發生的事情的複雜性。
在指令資料集上微調模型只是完成訓練的一種特殊情況,其中以有組織的方式建立資料集,以便它可以學習遵循指令。這是一個小例子,旨在揭秘使用專門庫進行微調時幕後發生的事情的複雜性。
Of course, Llama 7B is the smallest of the models out there, and one may obtain better results using the bigger brothers, but we managed to give instruction capabilities to a pre-trained model that did not have them. Now the model replies most of the time in the format specified and generates reasonable answers.
當然,Llama 7B 是現有模型中最小的,使用較大的模型可能會獲得更好的結果,但我們設法為不具備指令功能的預訓練模型提供指令功能。現在,模型大部分時間都會以指定的格式回應並產生合理的答案。
當然,Llama 7B 是現有模型中最小的,使用較大的模型可能會獲得更好的結果,但我們設法為不具備指令功能的預訓練模型提供指令功能。現在,模型大部分時間都會以指定的格式回應並產生合理的答案。
GPT-4 tends to prefer GPT-3.5... "GPTs prefer GPTs. — Ayush T." 🤣
GPT-4 往往更喜歡 GPT-3.5...“GPT 更喜歡 GPT。— Ayush T。” 🤣
This is the first of two articles about Instruction tuning. In the following piece, we will train the model by using the Hugging Face ecosystem and W&B integration. This will significantly simplify the preprocessing and code one must write.
這是兩篇關於指令調優的文章中的第一篇。在下一篇文章中,我們將使用 Hugging Face 生態系統和 W&B 整合來訓練模型。這將大大簡化必須編寫的預處理和程式碼。
這是兩篇關於指令調優的文章中的第一篇。在下一篇文章中,我們將使用 Hugging Face 生態系統和 W&B 整合來訓練模型。這將大大簡化必須編寫的預處理和程式碼。
How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning
Learn how to fine-tune an LLM on an instruction dataset! We'll cover how to format the data and train a model like Llama2, Mistral, etc. is this minimal example in (almost) pure PyTorch.
How to Run LLMs Locally With llama.cpp and GGML
This article explores how to run LLMs locally on your computer using llama.cpp — a repository that enables you to run a model locally in no time with consumer hardware.
Training Tiny Llamas for Fun—and Science
Exploring how SoftMax implementation can impact model performance using Karpathy's Tiny llama implementation.
How to Evaluate, Compare, and Optimize LLM Systems
This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
Add a comment
Never lose track of another ML project. Try W&B today.